iT邦幫忙

2023 iThome 鐵人賽

DAY 14
0
AI & Data

AI白話文運動系列之「A!給我那張Image!」系列 第 14

CNN經典論文實戰(一)--LeNet與AlexNet

  • 分享至 

  • xImage
  •  

前言

昨天我們講了一些歷史故事,主要是在介紹CNN界的兩位始祖:LeNet與AlexNet,一來是讓大家知道CNN的歷史比我們想像的還要久,二來也是藉由他們說明主流的CNN架構如何設計。今天我們回到實作的部分,讓大家實際感受一下這兩種模型的運作方式。

先備知識

  1. Python(至少對Python語法不陌生)
  2. 物件導向(至少需要知道class, function等概念)
  3. LeNet-5與AlexNet架構的特點(可以回顧:https://ithelp.ithome.com.tw/articles/10330192 )
  4. 捲積運算(可以回顧:https://ithelp.ithome.com.tw/articles/10323076 )
  5. 捲積神經網路(可以回顧:https://ithelp.ithome.com.tw/articles/10323077 )

看完今天的內容你可能會知道......

  1. 如何建構LeNet-5模型
  2. LeNet-5中的特殊架構怎麼處理
  3. 如何建構AlexNet模型

一、LeNet-5 Pytorch實戰

  • 昨天只有稍微介紹過LeNet-5的歷史背景,今天我們搭配著程式碼具體的來看一下每個細節!

    1. LeNet-5模型架構

    • https://ithelp.ithome.com.tw/upload/images/20230929/20163299Urq3pbXrRw.jpg
    • 我們昨天有提到,LeNet-5當中的第二層與第三層之間的連接方式與目前主流的方式有些差異,他並不是所有的捲積核都會跟輸入圖作用,而是按照下表的方式一一對應:
    • https://ithelp.ithome.com.tw/upload/images/20230929/20163299O65RwywDYF.jpg
    class LeNet5(nn.Module):
        def __init__(self):
            super(LeNet5, self).__init__()
            # Layer C1: Convolutional layer
            self.c1 = nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5, padding=2)
            # Layer S2: Sub-sampling layer (Max-Pooling)
            self.s2 = nn.MaxPool2d(kernel_size=2, stride=2)
            # Layer C3: Convolutional layer with special connections
            self.c3_1 = nn.Conv2d(in_channels=3, out_channels=1, kernel_size=5)
            self.c3_2 = nn.Conv2d(in_channels=3, out_channels=1, kernel_size=5)
            self.c3_3 = nn.Conv2d(in_channels=3, out_channels=1, kernel_size=5)
            self.c3_4 = nn.Conv2d(in_channels=3, out_channels=1, kernel_size=5)
            self.c3_5 = nn.Conv2d(in_channels=3, out_channels=1, kernel_size=5)
            self.c3_6 = nn.Conv2d(in_channels=3, out_channels=1, kernel_size=5)
            self.c3_7 = nn.Conv2d(in_channels=4, out_channels=1, kernel_size=5)
            self.c3_8 = nn.Conv2d(in_channels=4, out_channels=1, kernel_size=5)
            self.c3_9 = nn.Conv2d(in_channels=4, out_channels=1, kernel_size=5)
            self.c3_10 = nn.Conv2d(in_channels=4, out_channels=1, kernel_size=5)
            self.c3_11 = nn.Conv2d(in_channels=4, out_channels=1, kernel_size=5)
            self.c3_12 = nn.Conv2d(in_channels=4, out_channels=1, kernel_size=5)
            self.c3_13 = nn.Conv2d(in_channels=4, out_channels=1, kernel_size=5)
            self.c3_14 = nn.Conv2d(in_channels=4, out_channels=1, kernel_size=5)
            self.c3_15 = nn.Conv2d(in_channels=4, out_channels=1, kernel_size=5)
            self.c3_16 = nn.Conv2d(in_channels=6, out_channels=1, kernel_size=5)
    
            # Layer S4: Sub-sampling layer (Max-Pooling)
            self.s4 = nn.MaxPool2d(kernel_size=2, stride=2)
            # Layer C5: Fully connected layer
            self.c5 = nn.Linear(5 * 5 * 16, 120)
            # Layer F6: Fully connected layer
            self.f6 = nn.Linear(120, 84)
            # Output layer
            self.output = nn.Linear(84, 10)
    
        def forward(self, x):
            # Layer C1: Convolutional layer
            x = torch.relu(self.c1(x))
            # Layer S2: Sub-sampling layer
            x = self.s2(x)
            # Layer C3: Convolutional layer with special connections
            x1 = torch.relu(self.c3_1(x[:,:3,:,:]))
            x2 = torch.relu(self.c3_2(x[:,1:4,:,:]))
            x3 = torch.relu(self.c3_3(x[:,2:5,:,:]))
            x4 = torch.relu(self.c3_4(x[:,3:6,:,:]))
            x5 = torch.relu(self.c3_5(torch.cat((x[:,:1,:,:], x[:,4:6,:,:]), dim=1)))
            x6 = torch.relu(self.c3_6(torch.cat((x[:,:2,:,:], x[:,5:6,:,:]), dim=1)))
            x7 = torch.relu(self.c3_7(x[:,0:4,:,:]))
            x8 = torch.relu(self.c3_8(x[:,1:5,:,:]))
            x9 = torch.relu(self.c3_9(x[:,2:6,:,:]))
            x10 = torch.relu(self.c3_10(torch.cat((x[:,:1,:,:], x[:,3:6,:,:]), dim=1)))
            x11 = torch.relu(self.c3_11(torch.cat((x[:,:2,:,:], x[:,4:6,:,:]), dim=1)))
            x12 = torch.relu(self.c3_12(torch.cat((x[:,:3,:,:], x[:,5:6,:,:]), dim=1)))
            x13 = torch.relu(self.c3_13(torch.cat((x[:,:2,:,:], x[:,3:5,:,:]), dim=1)))
            x14 = torch.relu(self.c3_14(torch.cat((x[:,1:3,:,:], x[:,4:6,:,:]), dim=1)))
            x15 = torch.relu(self.c3_15(torch.cat((x[:,:1,:,:], x[:,2:4,:,:], x[:,5:6,:,:]), dim=1)))
            x16 = torch.relu(self.c3_16(x))
            x = torch.cat((x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15, x16), dim=1)
            # Layer S4: Sub-sampling layer
            x = self.s4(x)
            # Flatten the feature maps for fully connected layers
            x = x.view(x.size(0), -1)
            # Layer C5: Fully connected layer
            x = torch.relu(self.c5(x))
            # Layer F6: Fully connected layer
            x = torch.relu(self.f6(x))
            # Output layer
            x = self.output(x)
            return x
    
    • 為了達成這樣的架構,我們在S2的輸出結果與C3的輸入之間獨立建構了16的捲積核,分別依照上表的方式處理特定的輸入圖的某些通道。根據這樣的組合方式,我們可以計算出S2與C3之間所需要學習的總參數量:(5*5*3+1)*6+(5*5*4+1)*9+(5*5*6+1)*1=1516
    • 註記:單個捲積核中的參數量為每個捲積核當中的元素個素,因此如果輸入通道為C個,捲積核大小為KxK,則總參數量為(KxKxC+1),上面我們使用了6個輸入通道為3的捲積核+9個輸入通道為4的捲積核+一個輸入通道為1的捲積核,因此總數才是1516。公式中的常數1是因為我們通常會給每個捲積核一個可學習的參數:Bias,用來調整該捲積核中的所有元素。

    2. 訓練流程

    • 介紹完模型架構之後,我們把這樣的架構應用在MNIST手寫數字辨識資料集上面試試看!
    import torch
    import torch.nn as nn
    import torch.optim as optim
    import torchvision
    import torchvision.transforms as transforms
    
    batch_size = 64
    learning_rate = 0.001
    num_epochs = 10
    
    transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])
    
    train_dataset = torchvision.datasets.MNIST(root='./data', train=True, transform=transform, download=True)
    train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)
    
    test_dataset = torchvision.datasets.MNIST(root='./data', train=False, transform=transform, download=True)
    test_loader = torch.utils.data.DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=False)
    
    model = LeNet5()
    
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)
    
    total_step = len(train_loader)
    for epoch in range(num_epochs):
        for i, (images, labels) in enumerate(train_loader):
            # Forward pass
            outputs = model(images)
            loss = criterion(outputs, labels)
    
            # Backward pass and optimize
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
    
            if (i + 1) % 100 == 0:
                print(f'Epoch [{epoch+1}/{num_epochs}], Step [{i+1}/{total_step}], Loss: {loss.item():.4f}')
    
    print('Training finished.')
    

    3. 評估模型表現

    • 訓練好模型之後,我們可以利用測試資料集當中的資料評估模型訓練成果的好壞(測試資料不同於訓練資料,是模型從未看過的新資料)。
    • 在測試模型的時候,有兩個關鍵:測試模式以及不計算梯度。前者指的是model.eval(),後者指的是with torch.no_grad():,目的在於告訴模型說現在我們只是在測試模型,不需要訓練,所以不用計算梯度,也不需要更新參數。這樣的另外一個好處是因為少了很多步驟,所以可以讓整個流程變快一點。
    model.eval()  # Set model to evaluation mode
    
    correct = 0
    total = 0
    
    with torch.no_grad():
        for images, labels in test_loader:
            outputs = model(images)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    
    print(f'Test Accuracy: {100 * correct / total}%')
    

    4. 完整程式碼

    • 值得注意的是,我們這次的實作中使用的激勵函數、下取樣層與輸出層並非論文原文所使用的,這部分大家可以自行替換,會有不同的效果。我們今天的內容只著重在討論LeNet-5中大家最容易產生問題的部分:S2與C3之間的連接,以及參數量如何計算。
    
    import torch
    import torch.nn as nn
    import torch.optim as optim
    import torchvision
    import torchvision.transforms as transforms
    
    class LeNet5(nn.Module):
        def __init__(self):
            super(LeNet5, self).__init__()
            # Layer C1: Convolutional layer
            self.c1 = nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5, padding=2)
            # Layer S2: Sub-sampling layer (Max-Pooling)
            self.s2 = nn.MaxPool2d(kernel_size=2, stride=2)
            # Layer C3: Convolutional layer with special connections
            self.c3_1 = nn.Conv2d(in_channels=3, out_channels=1, kernel_size=5)
            self.c3_2 = nn.Conv2d(in_channels=3, out_channels=1, kernel_size=5)
            self.c3_3 = nn.Conv2d(in_channels=3, out_channels=1, kernel_size=5)
            self.c3_4 = nn.Conv2d(in_channels=3, out_channels=1, kernel_size=5)
            self.c3_5 = nn.Conv2d(in_channels=3, out_channels=1, kernel_size=5)
            self.c3_6 = nn.Conv2d(in_channels=3, out_channels=1, kernel_size=5)
            self.c3_7 = nn.Conv2d(in_channels=4, out_channels=1, kernel_size=5)
            self.c3_8 = nn.Conv2d(in_channels=4, out_channels=1, kernel_size=5)
            self.c3_9 = nn.Conv2d(in_channels=4, out_channels=1, kernel_size=5)
            self.c3_10 = nn.Conv2d(in_channels=4, out_channels=1, kernel_size=5)
            self.c3_11 = nn.Conv2d(in_channels=4, out_channels=1, kernel_size=5)
            self.c3_12 = nn.Conv2d(in_channels=4, out_channels=1, kernel_size=5)
            self.c3_13 = nn.Conv2d(in_channels=4, out_channels=1, kernel_size=5)
            self.c3_14 = nn.Conv2d(in_channels=4, out_channels=1, kernel_size=5)
            self.c3_15 = nn.Conv2d(in_channels=4, out_channels=1, kernel_size=5)
            self.c3_16 = nn.Conv2d(in_channels=6, out_channels=1, kernel_size=5)
    
            # Layer S4: Sub-sampling layer (Max-Pooling)
            self.s4 = nn.MaxPool2d(kernel_size=2, stride=2)
            # Layer C5: Fully connected layer
            self.c5 = nn.Linear(5 * 5 * 16, 120)
            # Layer F6: Fully connected layer
            self.f6 = nn.Linear(120, 84)
            # Output layer
            self.output = nn.Linear(84, 10)
    
        def forward(self, x):
            # Layer C1: Convolutional layer
            x = torch.relu(self.c1(x))
            # Layer S2: Sub-sampling layer
            x = self.s2(x)
            # Layer C3: Convolutional layer with special connections
            x1 = torch.relu(self.c3_1(x[:,:3,:,:]))
            x2 = torch.relu(self.c3_2(x[:,1:4,:,:]))
            x3 = torch.relu(self.c3_3(x[:,2:5,:,:]))
            x4 = torch.relu(self.c3_4(x[:,3:6,:,:]))
            x5 = torch.relu(self.c3_5(torch.cat((x[:,:1,:,:], x[:,4:6,:,:]), dim=1)))
            x6 = torch.relu(self.c3_6(torch.cat((x[:,:2,:,:], x[:,5:6,:,:]), dim=1)))
            x7 = torch.relu(self.c3_7(x[:,0:4,:,:]))
            x8 = torch.relu(self.c3_8(x[:,1:5,:,:]))
            x9 = torch.relu(self.c3_9(x[:,2:6,:,:]))
            x10 = torch.relu(self.c3_10(torch.cat((x[:,:1,:,:], x[:,3:6,:,:]), dim=1)))
            x11 = torch.relu(self.c3_11(torch.cat((x[:,:2,:,:], x[:,4:6,:,:]), dim=1)))
            x12 = torch.relu(self.c3_12(torch.cat((x[:,:3,:,:], x[:,5:6,:,:]), dim=1)))
            x13 = torch.relu(self.c3_13(torch.cat((x[:,:2,:,:], x[:,3:5,:,:]), dim=1)))
            x14 = torch.relu(self.c3_14(torch.cat((x[:,1:3,:,:], x[:,4:6,:,:]), dim=1)))
            x15 = torch.relu(self.c3_15(torch.cat((x[:,:1,:,:], x[:,2:4,:,:], x[:,5:6,:,:]), dim=1)))
            x16 = torch.relu(self.c3_16(x))
            x = torch.cat((x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15, x16), dim=1)
            # Layer S4: Sub-sampling layer
            x = self.s4(x)
            # Flatten the feature maps for fully connected layers
            x = x.view(x.size(0), -1)
            # Layer C5: Fully connected layer
            x = torch.relu(self.c5(x))
            # Layer F6: Fully connected layer
            x = torch.relu(self.f6(x))
            # Output layer
            x = self.output(x)
            return x
    
    
    batch_size = 64
    learning_rate = 0.001
    num_epochs = 10
    
    transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])
    
    train_dataset = torchvision.datasets.MNIST(root='./data', train=True, transform=transform, download=True)
    train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)
    
    test_dataset = torchvision.datasets.MNIST(root='./data', train=False, transform=transform, download=True)
    test_loader = torch.utils.data.DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=False)
    
    model = LeNet5()
    
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)
    
    total_step = len(train_loader)
    for epoch in range(num_epochs):
        for i, (images, labels) in enumerate(train_loader):
            # Forward pass
            outputs = model(images)
            loss = criterion(outputs, labels)
    
            # Backward pass and optimize
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
    
            if (i + 1) % 100 == 0:
                print(f'Epoch [{epoch+1}/{num_epochs}], Step [{i+1}/{total_step}], Loss: {loss.item():.4f}')
    
    print('Training finished.')
    
    model.eval()  # Set model to evaluation mode
    
    correct = 0
    total = 0
    
    with torch.no_grad():
        for images, labels in test_loader:
            outputs = model(images)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    
    print(f'Test Accuracy: {100 * correct / total}%')
    

二、AlexNet Pytorch實戰

  • 與LeNet-5的流程相同,昨天介紹完歷史背景的部分,今天我們則是藉由程式碼實際看一下AlexNet中每個架構的設計方式。

    1. AlexNet模型架構

    • AlexNet中總共有8個捲積層:捲積、下取樣、捲積、下取樣、捲積、捲積、捲積、下取樣、全連接、全連接、全連接,基本上與目前主流的CNN架構沒有特別大的差異。跟LeNet-5中的特殊結構相比,沒有甚麼需要特別注意的地方,因此實作上會比較輕鬆。
    # Define the AlexNet model
    class AlexNet(nn.Module):
        def __init__(self, num_classes=10):
            super(AlexNet, self).__init__()
            self.features = nn.Sequential(
                nn.Conv2d(3, 64, kernel_size=11, stride=4, padding=2),
                nn.ReLU(inplace=True),
                nn.MaxPool2d(kernel_size=3, stride=2),
                nn.Conv2d(64, 192, kernel_size=5, padding=2),
                nn.ReLU(inplace=True),
                nn.MaxPool2d(kernel_size=3, stride=2),
                nn.Conv2d(192, 384, kernel_size=3, padding=1),
                nn.ReLU(inplace=True),
                nn.Conv2d(384, 256, kernel_size=3, padding=1),
                nn.ReLU(inplace=True),
                nn.Conv2d(256, 256, kernel_size=3, padding=1),
                nn.ReLU(inplace=True),
                nn.MaxPool2d(kernel_size=3, stride=2),
            )
            self.avgpool = nn.AdaptiveAvgPool2d((6, 6))
            self.classifier = nn.Sequential(
                nn.Dropout(),
                nn.Linear(256 * 6 * 6, 4096),
                nn.ReLU(inplace=True),
                nn.Dropout(),
                nn.Linear(4096, 4096),
                nn.ReLU(inplace=True),
                nn.Linear(4096, num_classes),
            )
    
        def forward(self, x):
            x = self.features(x)
            x = self.avgpool(x)
            x = x.view(x.size(0), 256 * 6 * 6)
            x = self.classifier(x)
            return x
    

    2. 完整程式碼

    • 原版的AlexNet是訓練在ImageNet這個大型資料集上,然而這樣的訓練時間比較長,因此我們用個相對較小的資料集:CIFAR10來訓練模型,CIFAR10與MNIST一樣都是10個類別的分類任務,差別在MNIST的資料量較少,而且都是手寫數字,而CIFAR10則是貓貓狗狗、汽車飛機等物件。
    • 由於訓練和評估的方式與上面相同,因此這邊就不另外討論了,直接提供完整程式碼給大家參考:
    import torch
    import torch.nn as nn
    import torch.optim as optim
    import torchvision
    import torchvision.transforms as transforms
    
    # Define the AlexNet model
    class AlexNet(nn.Module):
        def __init__(self, num_classes=10):
            super(AlexNet, self).__init__()
            self.features = nn.Sequential(
                nn.Conv2d(3, 64, kernel_size=11, stride=4, padding=2),
                nn.ReLU(inplace=True),
                nn.MaxPool2d(kernel_size=3, stride=2),
                nn.Conv2d(64, 192, kernel_size=5, padding=2),
                nn.ReLU(inplace=True),
                nn.MaxPool2d(kernel_size=3, stride=2),
                nn.Conv2d(192, 384, kernel_size=3, padding=1),
                nn.ReLU(inplace=True),
                nn.Conv2d(384, 256, kernel_size=3, padding=1),
                nn.ReLU(inplace=True),
                nn.Conv2d(256, 256, kernel_size=3, padding=1),
                nn.ReLU(inplace=True),
                nn.MaxPool2d(kernel_size=3, stride=2),
            )
            self.avgpool = nn.AdaptiveAvgPool2d((6, 6))
            self.classifier = nn.Sequential(
                nn.Dropout(),
                nn.Linear(256 * 6 * 6, 4096),
                nn.ReLU(inplace=True),
                nn.Dropout(),
                nn.Linear(4096, 4096),
                nn.ReLU(inplace=True),
                nn.Linear(4096, num_classes),
            )
    
        def forward(self, x):
            x = self.features(x)
            x = self.avgpool(x)
            x = x.view(x.size(0), 256 * 6 * 6)
            x = self.classifier(x)
            return x
    
    # Hyperparameters
    batch_size = 64
    learning_rate = 0.001
    num_epochs = 10
    
    # Data preprocessing and loading
    transform = transforms.Compose([transforms.Resize((224,224)),
                                    transforms.ToTensor(),
                                    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
    
    train_dataset = torchvision.datasets.CIFAR10(root='./data', train=True, transform=transform, download=True)
    train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)
    
    test_dataset = torchvision.datasets.CIFAR10(root='./data', train=False, transform=transform)
    test_loader = torch.utils.data.DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=False)
    
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    
    # Initialize the AlexNet model
    model = AlexNet(num_classes=10).to(device)
    
    # Loss and optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)
    
    # Training loop
    total_step = len(train_loader)
    for epoch in range(num_epochs):
        model.train()
        for i, (images, labels) in enumerate(train_loader):
            outputs = model(images.to(device))
            loss = criterion(outputs, labels.to(device))
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            if (i + 1) % 100 == 0:
                print(f'Epoch [{epoch+1}/{num_epochs}], Step [{i+1}/{total_step}], Loss: {loss.item():.4f}')
    
    # Evaluation
    model.eval()
    with torch.no_grad():
        correct = 0
        total = 0
        for images, labels in test_loader:
            outputs = model(images)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    
    print(f'Test Accuracy: {100 * correct / total}%')
    
    
    • 和LeNet-5的程式碼有個地方不太一樣,這次我們使用了device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")這個東西,讓我們在硬體設備中有支援的GPU存在時可以使用GPU加速,這也是AlexNet與LeNet-5最大的不同:使用GPU加速訓練。

三、總結

  • 今天的內容一來是補充昨天沒有提及的架構細節,二來也是通過實際建立一個模型出來,好好感受一下模型的運作方式。之後的論文導讀也會遵循這個形式,讓大家除了聽故事之外,也可以累積實戰的經驗!

上一篇
CNN經典論文導讀(一)--LeNet與AlexNet
下一篇
番外篇--AI學習的大麻煩--激勵函數(Activation Function)與梯度爆炸/梯度消失
系列文
AI白話文運動系列之「A!給我那張Image!」30
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言